Incident Response
If you are not a member of the Mobile Platform team and need to report a mobile incident:
post a message in the va-mobile app channel, tagging @mobile-incident-response.
For members of the Core Mobile Platform team:
This document outlines the process for responding to critical VAHB incidents reported during business hours (Monday - Friday, 9am ET - 5pm PT).
For more explicit guidance around recurring issues, refer to Mobile Plays and Postmortems.
Step one - Acknowledge the incident
Regardless of the method by which the incident has been detected, take the following steps to acknowledge the incident and bring the incident response crew together.
- Acknowledge the issue in #va-mobile-app if it has not already been posted and tag @mobile-incident-response
- Identify the Incident Commander (Default to Michelle Middaugh, Backup: Ryan Thurlwell). This person will be the a primary point of contact and will be responsible for keeping Leadership updated with the current status of an active incident at least every hour with the following:
- The current state of the service
- Remediation steps taken
- Any new findings since the last update
- Theory eliminations (i.e. ‘What have we determined is not the cause?’)
- Anticipated next steps
- ETA for the next update (if possible)
- Identify the lead engineer (Default to Jon Bindbeutel, Backup: John Marchi)
- The IC creates a temporary Slack channel using conventional naming such as
010826-mobile-incident, tagging @mobile-incident-response. This channel will be dedicated to incident conversations and can be closed following post mortem and retro. - Return to the original incident post in #va-mobile-app and thread the name of the Incident commander and a link to the incident channel
- Open an incident room/bridge call and post in the temp channel
Step two - Determine impact
If it is a widespread outage affecting both web and mobile:
- Acknowledge the issue in #va-mobile-app and tag @mobile-incident-response
- Check the #oncall and #vfs-platform-support channels. If OCTO or Platform staff are aware of and addressing the incident, stand by and monitor for escalation of any mobile-specific questions. Report the incident if no post exists.
If the incident is an external outage:
- Acknowledge the issue in #va-mobile-app and tag @mobile-incident-response and the relevant Experience team(s)
- Check the #oncall and #vfs-platform-support channels. If OCTO or Platform staff are aware of and addressing the incident, stand by and monitor for escalation of any mobile-specific questions. Report the incident if no post exists.
If mobile only, first assess whether this is a security incident: Security incidents are prioritized as Critical and need to be escalated as described below.
- Has the system has been compromised by a third party?
- Has there been a leak of personally identifiable information?
- Is the system under attack?
- If yes to any of the above, the incident is considered a critical security incident. If this is the case, immediately inform VA Product Owners who will escalate the incident to Tier 3 following the steps in the Security Incidents documentation. Critical protocol applies.
If Mobile only and not a security incident: Determine the impact using the below matrix and with consideration for the user experience. An incident may be more severe if the app does not respond gracefully or present appropriate error messaging.
Incident level | Qualifications for level | Escalation steps |
|---|---|---|
| Critical incident |
|
|
| Priority/Major incident |
|
|
| Moderate/Medium incident |
| As a rule, no escalation is needed for medium incidents but the Incident Commander can choose to escalate at their discretion. |
| Minor/Low incident |
| As a rule, no escalation is needed for minor incidents but the Incident Commander can choose to escalate at their discretion. |
Additional Incident Matrix documentation: VA Major Incident Management Team Matrix
Step three - Communicate with Veterans and Stakeholders
In addition to the Incident Commander's communication requirements, consider others you may need to keep informed, including:
Veterans Use appropriate tooling and communication channels to ensure Veterans are aware of the issue as necessary and do not spend time doing work that will be lost. This may include:
- Adding an availability framework message (FE or BE)
- Disabling a given feature via remote config
Stakeholders Ensure that your VA points of contact are informed and aware of the issue, its impact, and expected resolution timeline. Please include a link to issues and/or slack conversations so that they may keep up to date on progress.
Step four - Diagnosis and determine path to resolution
- Determine the root cause
- Go wide, then go deep. First verify the overall state of the system. Is it a particular API endpoint that is unavailable, or the whole API? Isolate what parts of the system are actually affected.
- Rule out external factors. Given the nature of vets-api as a facade over other VA systems, check whether any relevant upstream dependencies are unavailable or experiencing elevated error rates. In this case the likely response process will be to notify the team responsible for the upstream system, and if possible, set a maintenance window in PagerDuty to trigger the downtime notification mechanism.
- Don't forget about non-API dependencies such as the VA network gateways that sit in front of our backend services. If these are experiencing an issue it's almost certainly also affecting VA.gov and possibly the VA as a whole.
- Look for what changed. If a behavior began suddenly, determine if anything changed recently that can account for it - did an API deployment occur? Did the Web Platform operations team change some infrastructure?
- Look inward. Is there something we did that caused the issue? Sometimes certs expire and upstream terms of services need to be accepted without our knowledge.
- Refer to Triaging an Incident steps as necessary
- The incident commander should capture notes, discussions, and other items (screenshots, log messages, etc.) that can act as a part of the record of the incident for later reporting to stakeholders, and to assist the team when a retrospective is conducted.
- Consider the Amount of time to next natural release
- Recommendation for mitigation and any applicable execution
- The Incident Commander continues to communicate the current state and process in the incident thread with relevant details
Step Five - After incident
A post-mortem report and retro are required for incidents which
- are determined to be Critical or High priority, including those which involve the security of a system or a Veteran's data
- require the use of a playbook
- impact a significant portion of the userbase
- persists for a significant period of time
- requires out-of-app coordination or an out-of-band app release
- OCTO leadership requests
Post Mortems The goal of the postmortem is not to assign blame, but to improve our ability to prevent, detect, and respond to future incidents. Possible follow-up actions that may result include adding additional monitoring, adding implementation safeguards, tuning alerts, adding documentation, and refining inter-team communication processes.
- Follow the instructions to create a postmortem document and get a draft up within 24 hours.
- Follow the instructions for the Incident Retrospective process
- Ensure that your team’s VA points of contact are aware of the incident resolution and given a chance to review the post-mortem as well as the VA Platform’s points of contact (Steve Albers and Erika Washburn)
- If the incident is such that it may occur again in the future – or if it follows a theme common to incidents in the past – add the incident and the steps to resolve it as a Mobile Incident Response Play.